Ford Gobike Data Exploration

by Mohammad Hanafy

Preliminary Wrangling

This document explores a data set that includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area, the data set contains approximately 183,412 records.

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

What is the structure of your dataset?

After cleaning the null values, the dataset contains 174,952 rows/entries and 16 columns/features: ('duration_sec', 'start_time', 'end_time', 'start_station_id', 'start_station_name', 'start_station_latitude', 'start_station_longitude', 'end_station_id', 'end_station_name', 'end_station_latitude', 'end_station_longitude', 'bike_id', 'user_type', 'member_birth_year', 'member_gender', 'bike_share_for_all_trip').

2 features in datetime format: ('start_time', 'end_time')

1 boolean format: ('bike_share_for_all_trip')

4 string/object format: ('start_station_name', 'end_station_name', 'user_type', 'member_gender' )

9 numerical format: ('duration_sec', 'start_station_id', 'start_station_latitude', 'start_station_longitude', 'end_station_id', 'end_station_latitude', 'end_station_longitude', 'bike_id', 'member_birth_year').

What is/are the main feature(s) of interest in your dataset?

I am interested in knowing how long does the average trip take, and what are the factors that most affect the trip duration.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

User type, gender, age could be very helpful, as well as the start and end stations.

Some preprocessing on the data

Univariate Exploration

%> In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

%> Make sure that, after every plot or related series of plots, that you include a Markdown cell with comments about what you observed, and what you plan on investigating next.

Distribution of the trips' durations

The plot shows that the duration values are condensed between 0 and 4000 seconds, with a peak at around 350 seconds.

Distribution of the member age

The plot shows that the age values are condensed between 20 and 40 years.

Distribution of the gender

The plot shows that the majority of the members are males.

Distribution of the customer type

The plot shows that the majority of the customers are subscribers.

Distribution of the trips' distance

The plot shows that the distance values are condensed between 200 and 2000 meters.

Distribution of start/end hours

The plot shows that most trips start and ends at 8 AM and 5 PM (Start and end of the working day)

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The trip duration values are condensed between 0 and 4000 seconds, with a peak at around 350 seconds. The original histogram was highly skewed to the right, so a log scale was used to fix this.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of the start hour and end hour were very close to each other, they both have peaks at 8AM and 5PM which are the start and end hours of the working day, also the age histogram showed peaks at 26 and 31 years, the age was calculated by subtracting the member birth year from the year of the dataset collection (2019), finally, the distance distribution showed that the values are condensed between 200 and 2000 meters, the distance of the trip was calculated using the start and end stations latitude and longitude coordinates.

Bivariate Exploration

%> In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

Duration vs Age

The plot shows that the when the age between 20 to 50, the trip duration is higher than the older ages.

Duration vs Distance

The above scatter plot shows that the duration has a high relation with distance.

Duration vs Gender

The duration vs gender box plot shows that all the genders affect the duration in the same manner.

Duration vs User Type

The duration vs type box plot shows that the duration of the trip is longer for customers than for subscribers.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Trip Duration is highly dependent on the age of the member, when the age between 20 to 50, the trip duration is higher than the older ages, also the duration has a high relation with distance, finally, we can observe that the Customer category go in longer trips than the subscribers.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The gender didn't seem to affect the duration of the trips that much.

Multivariate Exploration

%> Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

Duration, Age and Gender

The duration, age, gender scatter plot shows that the females between the age of 20 and 50 took more and longer trips than the other genders.

Duration, Age and User Type

The above plot shows that the subscribers the trip duration is higher than customer for older age.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From previous plots. we know that the majority of users are males, but now we know that the females and others affect the trips' durations more.

Were there any interesting or surprising interactions between features?

Subscribers the trip duration is higher than customer for older age.